We’ll start by making some histograms.
#install.packages("dslabs")
library(dslabs)
data(heights)
glimpse(heights)
## Rows: 1,050
## Columns: 2
## $ sex <fct> Male, Male, Male, Male, Male, Female, Female, Female, Female, M…
## $ height <dbl> 75, 70, 68, 74, 61, 65, 66, 62, 66, 67, 72, 72, 69, 68, 69, 66,…
This data is the heights of humans, divided by their biological sex.
Use ggplot to make a histogram of all of the heights:
ggplot(heights, aes(x = height)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Change up the binwidth and see how the plots change. Try 1, 5, 10, and 20
ggplot(heights, aes(x = height)) + geom_histogram(binwidth = 1)
Smooth this out to an emperical density with
geom_density()
ggplot(heights, aes(x = height)) + geom_density()
Use a new argument in the aes(), group = to
split this density by sex
ggplot(heights, aes(x = height, group = sex)) + geom_density()
OR we can do it with color or fill. If you
say you want to color by sex, R knows that you want a different curve
for each of them.
ggplot(heights, aes(x = height, color = sex)) + geom_density()
If you’ve used fill, then there is now a slight issue that they are overlapped. We can fix this with alpha transparency!
ggplot(heights, aes(x = height, fill = sex)) + geom_density(alpha = .3)
Let’s make some boxplots of the same information.
ggplot(heights, aes(x = height, y = sex, fill = sex)) + geom_boxplot()
Find the mean and median overall.
heights %>% summarise(overallmean = mean(height), overallmedian = median(height))
## overallmean overallmedian
## 1 68.32301 68.5
Find the mean and median for both groups.
heights %>% group_by(sex) %>% summarise(mean = mean(height), median = median(height))
## # A tibble: 2 × 3
## sex mean median
## <fct> <dbl> <dbl>
## 1 Female 64.9 65.0
## 2 Male 69.3 69
How tall is the tallest woman? How short is the shortest man?
heights %>% group_by(sex) %>% summarise(tallest = max(height), shortest = min(height))
## # A tibble: 2 × 3
## sex tallest shortest
## <fct> <dbl> <dbl>
## 1 Female 79 51
## 2 Male 82.7 50
# install.packages("pscl")
library(pscl) # loads in the package that has this data.
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
## You might need to install this...
# data for presidental elections
votedata <- presidentialElections
glimpse(votedata)
## Rows: 1,097
## Columns: 4
## $ state <chr> "Alabama", "Arizona", "Arkansas", "California", "Colorado", "C…
## $ demVote <dbl> 84.76, 67.03, 86.27, 58.41, 54.81, 47.40, 48.11, 74.49, 91.60,…
## $ year <int> 1932, 1932, 1932, 1932, 1932, 1932, 1932, 1932, 1932, 1932, 19…
## $ south <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FAL…
Let’s look at the democratic vote by state for 2000. We can’t use
geom_bar for a bar chart, since we have the category in one
variable and the “height” of the bar in another. We need
geom_col()
Make a bar graph of the democratic vote by state in 2000.
votedata %>% filter(year == "2000") %>% ggplot(aes(x = state, y = demVote)) + geom_col()
Well this looks awful. We have two options: swap the x and y or the more fun sounding… Coordinate flip!
Use coord_flip() on the previous graph to make it
better.
votedata %>% filter(year == "2000") %>% ggplot(aes(x = state, y = demVote)) + geom_col() + coord_flip()
I don’t love the squashed together coordinates, but it’s a display window issue.
So. This is a helpful graph, but it would be more helpful if
it was ordered. Use x = reorder(x_variable, y_variable) in
aes() to order the x variable by the y variable
votedata %>% filter(year == "2000") %>% ggplot(aes(x = reorder(state, demVote), y = demVote)) + geom_col() + coord_flip()
So, what if I want to see what the north and south states did different?
start with a facet_wrap using the south variable:
votedata %>% filter(year == "2000") %>% ggplot(aes(x = reorder(state, demVote), y = demVote)) + geom_col() + coord_flip() + facet_wrap(vars(as.factor(south)))
Okay, that’s not great. Lets color by south instead.
votedata %>% filter(year == "2000") %>% ggplot(aes(x = reorder(state, demVote), y = demVote, fill = south)) + geom_col() + coord_flip() + scale_fill_manual(values = c("TRUE" = "#94b0da", "FALSE" = "#d7263d"))
I’m a good data scientist, so I want my plot to have a name! and my
axes to have lables! Use labs to add a title, subtitle, and
x and y labels.
votedata %>% filter(year == "2000") %>% ggplot(aes(x = reorder(state, demVote), y = demVote, fill = south)) +
geom_col() + coord_flip() +
scale_fill_manual(
name = "Region", # Legend title
values = c("TRUE" = "#94b0da", "FALSE" = "#d7263d"),
labels = c("TRUE" = "Non-South", "FALSE" = "South")) +
labs(title = "Percentage of Vote Won by Democratic Candidate", subtitle = "US Presidential Race of 2000", x = "Democrat Vote", y = "State")
You can move the legend with
theme(legend.position = "bottom")
votedata %>% filter(year == "2000") %>% ggplot(aes(x = reorder(state, demVote), y = demVote, fill = south)) +
geom_col() + coord_flip() +
scale_fill_manual(
name = "Region", # Legend title
values = c("FALSE" = "#94b0da", "TRUE" = "#d7263d"),
labels = c("FALSE" = "Non-South", "TRUE" = "South")) +
labs(title = "Percentage of Vote Won by Democratic Candidate", subtitle = "US Presidential Race of 2000", y = "State") + theme(legend.position = "bottom")
What else could we facet by? years! Let’s filter to year in 2008 and 2016, then facet by years.
votedata %>% filter(year == "2008" | year == "2016") %>% ggplot(aes(x = reorder(state, demVote), y = demVote, fill = south)) +
geom_col() + coord_flip() +
scale_fill_manual(
name = "Region", # Legend title
values = c("FALSE" = "#94b0da", "TRUE" = "#d7263d"),
labels = c("FALSE" = "Non-South", "TRUE" = "South")) +
labs(title = "Percentage of Vote Won by Democratic Candidate", subtitle = "US Presidential Race", x = "Democrat Vote", y = "State") + theme(legend.position = "bottom") + facet_wrap(~ year)
We need to know who won! We could add a vertical line at 50 for who
got more, to indicate the majority of votes. Adding the layer
geom_hline() adds a horizontal line. (What do you guess
geom_vline() would do?)
votedata %>% filter(year == "2008" | year == "2016") %>% ggplot(aes(x = reorder(state, demVote), y = demVote, fill = south)) +
geom_col() + coord_flip() +
scale_fill_manual(
name = "Region", # Legend title
values = c("FALSE" = "#94b0da", "TRUE" = "#d7263d"),
labels = c("FALSE" = "Non-South", "TRUE" = "South")) +
labs(title = "Percentage of Vote Won by Democratic Candidate", subtitle = "US Presidential Race", x = "Democrat Vote", y = "State") + theme(legend.position = "bottom") +
facet_wrap(~ year) +
geom_hline(yintercept = 50)
When using geom_polygon or geom_map, you will typically need two data frames:
An id variable links the two together.
Run the below code to get a map graph.
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
votedata$state <- tolower(votedata$state) ## states need to be lowercase for linking
states_map <- map_data("state") ## this gives us the lat and long for each point of each state.
map_plot <- ggplot(data = votedata %>% filter(year == 2008), aes(map_id = state)) +
geom_map(aes(fill = demVote), map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat)
map_plot
map_plot <- ggplot(data = votedata %>% filter(year == 2016), aes(map_id = state)) +
geom_map(aes(fill = demVote), map = states_map)+
expand_limits(x = states_map$long, y = states_map$lat)
map_plot
What if I want a map that shows which of the states are “south”? What do I change?
map_plot <- ggplot(data = votedata %>% filter(year == 2008), aes(map_id = state)) +
geom_map(aes(fill = south), map = states_map) +
expand_limits(x = states_map$long, y = states_map$lat)
map_plot
I want to know the average democratic vote for N vs S, by year.
First, find the average democratic votes for the north and the south,
every year. You’ll need to do a double group_by()
here. You do it in one call of the function.
votedata %>% group_by(year, south) %>% summarise(averagevotes = mean(demVote))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 44 × 3
## # Groups: year [22]
## year south averagevotes
## <int> <lgl> <dbl>
## 1 1932 FALSE 56.7
## 2 1932 TRUE 83.4
## 3 1936 FALSE 59.2
## 4 1936 TRUE 83.2
## 5 1940 FALSE 52.8
## 6 1940 TRUE 80.9
## 7 1944 FALSE 51.1
## 8 1944 TRUE 75.1
## 9 1948 FALSE 50.2
## 10 1948 TRUE 45.9
## # ℹ 34 more rows
Then, let’s plot that! Pipe the result of your group_by and summarize
to ggplot and geom_line(), with year on the x axis and your summarized
value on the y axis. Color by the south variable.
votedata %>% group_by(year, south) %>% summarise(averagevotes = mean(demVote)) %>% ggplot(aes(x = year, y = averagevotes, color = south)) + geom_line()
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
Penguins!
library(palmerpenguins)
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
We can use boxplots to visualize the distribution of weight (body_mass_g) within each species:
penguins %>% ggplot(aes(x = species, y = body_mass_g)) + geom_boxplot()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
What if we also want the points? Layering!! Add a geom_point to your existing boxplot. geom_boxplot + geom_point!
penguins %>% ggplot(aes(x = species, y = body_mass_g)) + geom_boxplot() + geom_point()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
But, these are all stacked up… to actually see them, use “geom_jitter” instead of points
penguins %>% ggplot(aes(x = species, y = body_mass_g)) + geom_boxplot() + geom_jitter()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
How to get the boxplots on top? The layers are plotted in the order you give them, so change to geom_point + geom_boxplot. (You might want to change the alpha on the boxplot to be able to see the plots under them)
penguins %>% ggplot(aes(x = species, y = body_mass_g)) + geom_jitter() + geom_boxplot(alpha = .7)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Maybe let’s try replacing the boxplot with a
geom_violin()?
penguins %>% ggplot(aes(x = species, y = body_mass_g)) + geom_jitter() + geom_violin(alpha = .7)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
penguins %>% group_by(species, sex) %>% summarise(n = n()) %>% pivot_wider(names_from = sex, values_from = n, names_prefix = "n_")
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
## # A tibble: 3 × 4
## # Groups: species [3]
## species n_female n_male n_NA
## <fct> <int> <int> <int>
## 1 Adelie 73 73 6
## 2 Chinstrap 34 34 NA
## 3 Gentoo 58 61 5
penguins %>% group_by(island) %>% summarise(avg_mass = mean(body_mass_g, na.rm=TRUE)) %>% pivot_wider(names_from = island, values_from = avg_mass, names_prefix = "avg_mass_")
## # A tibble: 1 × 3
## avg_mass_Biscoe avg_mass_Dream avg_mass_Torgersen
## <dbl> <dbl> <dbl>
## 1 4716. 3713. 3706.
penguins %>% group_by(sex) %>% summarise(avg_bill_len = mean(bill_length_mm, na.rm=TRUE)) %>% pivot_wider(names_from = sex, values_from = avg_bill_len, names_prefix = "avg_bill_len_")
## # A tibble: 1 × 3
## avg_bill_len_female avg_bill_len_male avg_bill_len_NA
## <dbl> <dbl> <dbl>
## 1 42.1 45.9 41.3
penguins %>% filter(sex == "female") %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) + geom_point() + labs(title = "Female Penguin Bills", x = 'Bill Length (mm)', y = 'Bill depth (mm)') + theme_bw()
penguins %>% ggplot(aes(x = bill_length_mm, y = body_mass_g, color = species)) + geom_point() + labs(title = "Bill Stats by Island", x = 'Bill Length (mm)', y = 'Body Mass (g)', color = 'Species') + facet_wrap(~ island) + theme_bw()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
penguins %>% ggplot(aes(x = flipper_length_mm, color = sex)) + geom_density() + labs(title = "Flipper Length by Sex", x = 'Flipper Length (mm)', color = 'Sex') + theme_bw()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).
penguins %>% ggplot(aes(x = body_mass_g, color = year)) + geom_density() + labs(title = "Body Mass by Year", x = 'Body Mass (g)', color = 'Year') + theme_bw()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning: The following aesthetics were dropped during statistical transformation:
## colour.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?